← Back to Model Comparison
llama-3.2-3B-instruct
18.7%
Overall Accuracy
Answer Key:
claude-opus-4-5-20251101
Boundary Models:
20
Pairs:
190
Total Rollouts:
950
Max Turns:
5
Question Difficulty Distribution
47.1%
18.7%
34.2%
Too Easy (447)
Calibrated (178)
Too Hard (325)
Pairwise Accuracy Matrix
Conversation Explorer
Boundary Model A
claude-haiku
claude-sonnet
claude-opus
llama-3.2-1b
llama-3.2-3b
gemma-3-4b-it
gemma-3-12b-it
gemma-3-27b-it
qwen3-1.7b
qwen3-4b
qwen3-8b
qwen3-14b
qwen3-4b-instruct
qwen3-30b-instruct
olmo3-7b
olmo3-32b
mistral-3-3b
mistral-3-8b
mistral-3-14b
gpt-oss-20b
Boundary Model B
claude-haiku
claude-sonnet
claude-opus
llama-3.2-1b
llama-3.2-3b
gemma-3-4b-it
gemma-3-12b-it
gemma-3-27b-it
qwen3-1.7b
qwen3-4b
qwen3-8b
qwen3-14b
qwen3-4b-instruct
qwen3-30b-instruct
olmo3-7b
olmo3-32b
mistral-3-3b
mistral-3-8b
mistral-3-14b
gpt-oss-20b
Pair Accuracy:
--
← Previous
Next →
Conversation
1
of
0
💬
Select a model pair to view conversations